Using a recursive reinforcement model to determine an agent action

ABSTRACT

According to examples, an apparatus may include a processor and a memory on which is stored machine readable instructions that may cause the processor to access data about an environment of an agent, identify an actor in the environment, and access candidate models, in which each of the candidate models may predict a certain action of the identified actor. The instructions may also cause the processor to apply a selected candidate model of the accessed candidate models on the accessed data to determine a predicted action of the identified actor and may implement a recursive reinforcement learning model using the predicted action of the identified actor to determine an action that the agent is to perform. The instructions may further cause the processor to cause the agent to perform the determined action.

BACKGROUND

Autonomous devices may use or include computing systems to determineactions that the autonomous devices may perform. Some of the actions mayinclude, for instance, navigating from one location to another,actuating a manipulator, maintaining a position of a vehicle within alane along a road, and/or the like. In many instances, the computingsystems may determine the actions to be actions that may prevent theautonomous devices from performing actions that may deviate from aspecified set of operations.

BRIEF DESCRIPTION OF DRAWINGS

Features of the present disclosure are illustrated by way of example andnot limited in the following figure(s), in which like numerals indicatelike elements, in which:

FIG. 1 shows a block diagram of a system that may include an apparatus,in which the apparatus may determine an action that an agent is toperform based on a predicted action of an actor in an environment of theagent, in accordance with an embodiment of the present disclosure;

FIG. 2 shows a block diagram of the apparatus depicted in FIG. 1 , inaccordance with an embodiment of the present disclosure;

FIG. 3A shows a diagram of a first order reinforcement learning modelfor the actor shown in FIG. 1 , in accordance with an embodiment of thepresent disclosure;

FIG. 3B shows a diagram of a (recursive) second order reinforcementlearning model for the actor shown in FIG. 1 , in accordance with anembodiment of the present disclosure;

FIG. 3C shows a diagram of a (recursive) third order reinforcementlearning model for the agent shown in FIG. 1 , in accordance with anembodiment of the present disclosure;

FIGS. 4 and 5 , respectively, depict flow diagrams of methods forapplying a third order reinforcement learning model to determine anaction that an agent is to perform, in accordance with embodiments ofthe present disclosure;

FIG. 6 depicts a block diagram of a computer-readable medium that mayhave stored thereon computer-readable instructions for determining anaction that an agent is to perform using predicted actions of a firstactor and a second actor, in accordance with an embodiment of thepresent disclosure; and

FIG. 7 depicts a diagram of a reinforcement learning model according toan embodiment of the present disclosure.

DETAILED DESCRIPTION

For simplicity and illustrative purposes, the principles of the presentdisclosure are described by referring mainly to embodiments and examplesthereof. In the following description, numerous specific details are setforth in order to provide an understanding of the embodiments andexamples. It will be apparent, however, to one of ordinary skill in theart, that the embodiments and examples may be practiced withoutlimitation to these specific details. In some instances, well knownmethods and/or structures have not been described in detail so as not tounnecessarily obscure the description of the embodiments and examples.Furthermore, the embodiments and examples may be used together invarious combinations.

Throughout the present disclosure, the terms “a” and “an” are intendedto denote at least one of a particular element. As used herein, the term“includes” means includes but not limited to, the term “including” meansincluding but not limited to. The term “based on” means based at leastin part on.

Disclosed herein are apparatuses, methods, and computer-readable mediain which a processor may implement a recursive reinforcement learningmodel to determine an action that an agent is to perform. Particularly,the processor may determine a predicted action of an actor identified inan environment of the agent and may implement the recursivereinforcement learning model to determine the action that the agent isto perform. In some examples, the processor may implement a machinelearning model, e.g., a first order model, on the actor and theenvironment to determine the predicted action of the actor. The machinelearning model may be a reinforcement learning model and the processormay determine the predicted action to be an action that may maximize areward for the actor and/or minimize a penalty for the actor accordingto a reward/penalty policy for the actor. In addition, the processor maydetermine the action that the agent is to perform using the predictedaction of the actor as a factor in the recursive reinforcement learningmodel. For instance, the processor may determine the action that theagent is to perform as an action that may maximize both the reward forthe actor and a reward for the agent according to a reward policy and/ormay minimize both a penalty for the actor and a penalty for the agentaccording to a penalty policy.

In some examples, the recursive reinforcement learning model may be athird order model. In these examples, the processor may determine apredicted action of a second actor using a first order reinforcementlearning model and may use the predicted action of the second actor as afactor in a second order reinforcement learning model of a first actor.For instance, the processor may use the predicted action that results ina maximized reward for the second actor in determining a predictedaction of the first actor. In addition, the processor may use thepredicted action of the first actor in a third order model to determinethe action that the agent is to perform.

According to examples, a server may generate the models that theprocessor may use to determine the predicted actions of the actors andthe agents and may communicate the models to the processor. In addition,the processor may upload data corresponding to the predicted actions tothe server and the server may update the models based on the uploadeddata. As a result, the server may continuously update the models tothereby improve their accuracies.

Through implementation of the features of the present disclosure, aprocessor may accurately determine actions that an agent is to performand may cause the agent to perform the determined actions. That is, forinstance, the processor may determine actions that an agent is toperform that may prevent the agent from operating in a dangerous and/orunintended manner with respect to objects in an environment in which theagent is operating. As a result, the processor may cause the agent toperform actions that may be safe to the agent as well as to the actorsin the environment of the agent. The processor may accurately determinethe actions through implementation of recursive reinforcement learningmodels on actors in the environment as well as the agent. In addition,the processor may determine actions for the agent that may enable theagent to operate in an energy efficient manner, e.g., by determiningactions that may reduce wasted movements of the agent as well as otheractors in the environment.

Reference is first made to FIGS. 1 and 2 . FIG. 1 shows a block diagramof a system 101 that may include an apparatus 100, in which theapparatus 100 may determine an action that an agent 110 is to performbased on a predicted action of an actor 120 in an environment 130 of theagent 110, in accordance with an embodiment of the present disclosure.FIG. 2 shows a block diagram of the apparatus 100 depicted in FIG. 1 ,in accordance with an embodiment of the present disclosure. It should beunderstood that the apparatus 100 depicted in FIGS. 1 and 2 may includeadditional features and that some of the features described herein maybe removed and/or modified without departing from the scope of theapparatus 100.

The apparatus 100 may be a computing device such as a laptop computer, atablet computer, a smartphone, or the like, and may be separate from theagent 110. In other examples, the apparatus 100 may be a control systemof the agent 110 and may thus be integrated with the agent 110. In anyof these examples, the apparatus 100 may be mounted on or in the agent110. In addition, the apparatus 100 may include a processor 102 that maycontrol operations of various components of the apparatus 100 and amemory 104 on which data that the processor 102 may access and/or mayexecute may be stored.

The processor 102 may be a semiconductor-based microprocessor, a centralprocessing unit (CPU), an application specific integrated circuit(ASIC), a field-programmable gate array (FPGA), and/or other hardwaredevice. The memory 104, which may also be termed a computer-readablemedium, may be, for example, a Random Access memory (RAM), anElectrically Erasable Programmable Read-Only Memory (EEPROM), a storagedevice, or the like. The memory 104 may be a non-transitorycomputer-readable storage medium, where the term “non-transitory” doesnot encompass transitory propagating signals. In any regard, the memory104 may have stored thereon machine readable instructions that theprocessor 102 may execute to control various operations of the apparatus100 and/or the agent 110.

Although the apparatus 100 is depicted as having a single processor 102,it should be understood that the apparatus 100 may include additionalprocessors and/or cores without departing from a scope of the apparatus100. In this regard, references to a single processor 102 as well as toa single memory 104 may be understood to additionally or alternativelypertain to multiple processors 102 and multiple memories 104. Inaddition, or alternatively, the processor 102 and the memory 104 may beintegrated into a single component, e.g., an integrated circuit on whichboth the processor 102 and the memory 104 may be provided.

According to examples, the agent 110 may be a physical device that mayautonomously maneuver itself (e.g., under the direction of the processor102) in the environment 130, e.g., a robot, an automobile, an aircraft,and/or the like. That is, the agent 110 may be an autonomous device thatmay move in various directions and/or may include components, e.g.,arms, graspers, manipulators, and/or the like, in which movements of theagent 110 and/or components of the agent 110 may occur without receiptof user instructions to do so (under the direction of the processor102). As shown, the agent 110 may include an actuator 112 that may beactuated to move the agent 110 and/or components of the agent 110. Theactuator 112 may be a motor or other device that may drive wheels on theagent 110, that may manipulate an arm on the agent 110, that mayactivate an indicator, and/or the like. In any regard, and as discussedherein, the processor 102 may determine an action that the agent 110 isto perform based on a predicted action of actors 120, 122 in theenvironment 130 as well as other data in and/or about the environment130.

By way of example, the agent 110 (along with the processor 102) may bean autonomous vehicle, an autonomous land-based robot, an autonomousflying vehicle, an autonomous water-based robot, and/or the like. Inparticular examples, the agent 110 may be a self-driving car, a robot ina warehouse, a drone, etc. As discussed herein, the processor 102 of theapparatus 100, which may be integrated with the agent 110, may makedecisions as to how the agent 110 may operate, e.g., the actions thatthe agent 110 may perform based on conditions of the environment 130 inwhich the agent 110 is operating. The environment 130 may be an area inwhich the agent 110 may operate and may include various objects withwhich the agent 110 interact. By way of example, the environment 130 maybe a road along which the agent 110 is travelling and an area around theroad, a warehouse in which the agent 110 is operating, a building inwhich the agent 110 is operating, an aerial space around the agent 110,and/or the like. The environment 130 may additionally be any other areain which the agent 110 may operate or may currently be operating.

The agent 110 may further include a sensor 114 that may track objects inthe environment 130. In some examples, the sensor 114 may be a camera ora plurality of cameras that may capture images, e.g., videos, of theenvironment 130. In addition or in other examples, the sensor 114 may bea radar-based sensor that may track the locations and distances ofobjects in the environment 130 with respect to the agent 110. Forinstance, the sensor 114 may be a sonar-based sensor, a radar-basedsensor, a light detection and ranging (LIDAR) based sensor, and/or thelike.

According to examples, the processor 102 may determine, from the data116 collected by the sensor 114, objects, e.g., actors 120, 122, alongwith their respective locations and/or movements. That is, for instance,the processor 102 may implement an object recognition program that mayidentify objects in the captured images and may also determine thelocations and/or distances of the objects with respect to the agent 110.In some examples, the processor 102 may distinguish the objects in thecaptured images that are moving and/or are able to move from otherobjects, such as, objects that are fixed and/or are not likely to move.In addition, the processor 102 may identify the objects that are movingand/or are able to move as actors 120, 122. In other words, theprocessor 102 may identify as objects, e.g., entities, that may haveintension, e.g., the actor 120, 122 may have an intention to perform anoperation, and agency, e.g., the actor 120, 122 may take an action.

Examples of objects that are fixed and/or not likely to move mayinclude, for instance, street signs, trees, rocks, shelves, walls,buildings, or the like. Examples of objects that are moving and/or areable to move, which are referenced herein as the actors 120, 122, may bevehicles, robots, drones, etc., and/or other mobile entities, such aspeople, animals, scooters, bicycles, etc. In some examples, theprocessor 102 may determine that an object is moving in instances inwhich a position of the object changes in multiple images. The processor102 may also predict the motion of an object, e.g., that the object ispredicted to continue to move in a particular direction, followingmovement of the object outside of the view of the sensor 114.

In some examples, the processor 102 may identify actors 120, 122 asobjects that move between frames of images that the sensor 114 may havecaptured and/or are predicted to move between the frames and/or infuture frames. The processor 102 may also identify an object whichappears stationary to be an actor 120 from learned data, e.g., theprocessor 102 may identify an automobile or a person that is currentlystationary as an actor 120. The processor 102 may dynamically identifynovel objects, e.g., an object that the processor 102 may not have seenbefore, as an actor 120 based on behavior of the object. For instance,the processor 102 may distinguish between inanimate objects and objectsthat may be intelligently controlled based on the manners in which theobjects may be moving. For instance, an inanimate object may move alonga straight line, e.g., a baseball, whereas an intelligently controlledobject, e.g., a bird, may perform independent actions such as stoppingand then moving, turning, etc.

The agent 110 may also include a tracking mechanism 117 that may includecomponents for tracking the spatial location of the agent 110, e.g.,within the environment 130. For instance, the tracking mechanism 117 mayinclude an accelerometer that may detect accelerations of the agent 110and a gyroscope that may detect rotational movements of the agent 110.The tracking mechanism 117 may also include global position system (GPS)components that may track a geographic position of the agent 110. Inexamples, the processor 102 may determine the spatial location of theagent 110, the direction at which the agent 110 is facing, as well asthe direction in which agent 110 is moving from data identified by thetracking mechanism 117.

Although not shown in FIG. 1 , the agent 110 may include othercomponents such as a microphone to capture audio, a speaker to outputaudio, a receptacle to receive a battery, a wireless transceiver, and/orthe like. The agent 110 may also include a display, lights, and/or othertypes of visual indicators.

In addition, or in other examples, the agent 110 may be a virtual agent,for example, an artificial intelligence program that may operate in avirtual environment such as a video game, a collaborative workspace, asocial media platform, and/or the like. In these examples, the agent 110may virtually track other actors 120, 122 in the virtual environment andthe processor 102 may determine actions that the actors 120, 122 in thevirtual environment may be predicted to perform as discussed herein.Additionally, the processor 102 may determine an action that the agent110 is to perform based on the predicted actions that the actors 120,122 in the virtual environment are predicted to perform.

As shown in FIG. 2 , the memory 104 may have stored thereonmachine-readable instructions 202-212 that the processor 102 mayexecute. Although the instructions 202-212 are described herein as beingstored on a memory and may thus include a set of machine readableinstructions, the apparatus 100 may include hardware logic blocks thatmay perform functions similar to the instructions 202-212. For instance,the processor 102 may include hardware components that may execute theinstructions 202-212. In other examples, the apparatus 100 may include acombination of instructions and hardware logic blocks to implement orexecute functions corresponding to the instructions 202-212. In any ofthese examples, the processor 102 may implement the hardware logicblocks and/or execute the instructions 202-212. As discussed herein, theapparatus 100 may also include additional instructions and/or hardwarelogic blocks such that the processor 102 may execute operations inaddition to or in place of those discussed above with respect to FIG. 2.

The processor 102 may execute the instructions 202 to access data 116about an environment 130 of an agent 110. The data 116 may includeinformation pertaining to the agent 110 and the actors 120, 122 in theenvironment 130. The information pertaining to the agent 110 may includecurrent geographical information, directional information, movementinformation, etc., of the agent 110. The information pertaining to theactors 120, 122 may be images of the actors 120, 122, identification ofthe actors 120, 122, current positions of the actors 120, 122 withrespect to the agent 110, current movement of the actors 120, 122, etc.For instance, the data 116 may indicate that a first actor 120 maycurrently be moving in a direction and speed denoted by the arrow 124while a second actor 122 may currently be stationary.

The processor 102 may execute the instructions 204 to identify an actor120 in the environment 130. Particularly, the processor 102 may identifyan actor 120 that is currently moving and/or may move within apredefined period of time, e.g., within a window of time during whichthe agent 110 is within the environment in which the actor 120 islocated. As such, for instance, the processor 102 may distinguishbetween objects in the environment 130 that may be fixed and thus maynot likely be movable within the predefined period of time in from theactors 120, 122 that may be predicted to likely move within thepredefined period of time. As discussed herein, the processor 102 maypredict the actions of the actors 120, 122 and may use the predictionsof the actions of the actors 120, 122 in determining an action that theagent 110 is to perform. Although particular reference is made to twoactors 120, 122, it should be understood that features of the presentdisclosure may be expanded to any number of additional actors in theenvironment 130. Thus, for instance, the processor 102 may predict theactions of each of the actors in the environment and may use thepredictions of the actions of the actors in determining an action thatthe agent 110 is to perform.

According to examples, the processor 102 may access images captured bythe sensor 114 around the agent 110 and may determine the environment130 of the agent 110 from the images. The sensor 114, which may be acamera or multiple cameras on the agent 110, may capture the images andthe agent 110 may communicate the images, e.g., as data 116, to theprocessor 102. In addition, the processor 102 may analyze the accessedimages to identify features in the environment 130 including a landscapeof the environment 130, as well as objects and the actors 120, 122 inthe environment 130. For instance, the processor 102 may employ anobject recognition model to identify objects in the accessed images andto also identify the actor 120 as being one of the identified objects.The object recognition model may be programmed to distinguish betweenmultiple types of objects and the processor 102 may employ the objectrecognition model to determine the types of objects that are included inthe captured images.

The processor 102 may employ the object recognition model to determineone of the identified objects that is currently moving and/or may likelymove as an actor 120. The processor 102 may also determine what theactor 120 is from execution of the object recognition model, e.g.,whether the actor 120 is a car, a person on a bicycle, a person on astreet, an animal, or the like.

The processor 102 may execute the instructions 206 to access candidatemodels 142, in which each of the candidate models 142 may predict acertain behavior of the identified actor 120. In some examples, theprocessor 102 may communicate a request to the server 140 for thecandidate models 142, in which the request may include an identificationof what the actor 120 is, e.g., that the actor 120 is a car, a person,or the like. In response, the processor 102 may select a plurality ofcandidate models 142 from a number of models stored on the server 140 orin a storage accessible by the server 140, that may correspond to theidentified type of the actor 120. Each of the candidate models 142 maypredict a particular behavior of the identified actor 120. The processor102 may also communicate the selected candidate models 142 to theprocessor 102.

According to examples, the processor 102 may access a lookup table thatmay include a listing of actor types and the models that correspond tothe actor types to select the plurality of candidate models 142.Although in other examples, the processor 102 may select the candidatemodels 142 in other manners. In some examples, the server 140 maygenerate, develop, train, etc., the candidate models 142 of the actor120 from, for instance, data collected from the actor 120 as well asdata collected from other sources and may use the data to train themodels. The server 140 may generate, e.g., train, the candidate models142, which may be machine learning models, based on multiple inputs andoutputs corresponding to the actor 120. For instance, the server 140 maycollect data pertaining the actor 120 type from the apparatus 100 and/orother sources. The data may include actions that the actor 120 performedbased on various types of inputs and the types of environments 130 inwhich the actor 120 was positioned. In any regard, the server 140 mayuse any suitable machine learning processes to train the candidatemodels 142.

By way of particular example in which the actor 120 is an automobile,the server 140 may train a first candidate model 142 with a first inputin which the automobile is in a right lane at a traffic light and theautomobile made a right turn. The server 140 may also train a secondcandidate model 142 with a second input in which the automobile is in acenter lane at a traffic light and the automobile when straight. Theserver 140 may further train additional candidate models 142 with othertypes of inputs and actions that the automobile performed. Thus, forinstance, the candidate models 142 in these examples may be used todetermine which direction an automobile may go based on the lane inwhich the automobile is located. The server 140 may train additionalcandidate models 142 for automobiles as well as for other types ofactors using types of inputs suitable for the types of actors. By way ofanother example in which the actor 120 is a person, the server 140 maytrain the candidate models 142 for the person using various types ofinput, such as, whether the user is at a cross walk, in the middle ofthe street, next to a vehicle, etc., and the actions corresponding tothose inputs.

The server 140 may generate models for a plurality of different types ofactors 120 and may update the models as additional training data iscollected. In some examples, the processor 102 may request, e.g., submita query for, the model for the actor 120 from the server 140 after theprocessor 102 has identified the actor 120. In response, the server 140may identify the model for the actor 120 and may communicate the modelto the processor 102 via the network 150, which may be the Internet, alocal area network, a cellular network, a combination thereof, and/orthe like. According to examples, a computing device or server other thanthe server 140 may generate the models for the various actor types. Inthese examples, the server 140 may access the generated models from theother computing device or other server in selecting the candidate models142 to send to the apparatus 100.

The processor 102 may access a variety of behavioral templates that maybe available for the actor 120, in which each of the behavioraltemplates is a different model. Each of the models may predict differenttypes of behavior, different goals or reward policies, different amountsof sophistication, and/or the like. The processor 102 may examine ahistory of behavior and may predict the behavior given each of thevariety of behavioral templates, and may then apply the template (model)that best fits the behavior.

The processor 102 may also be equipped with a cloud-based learningmodel. In this case, the processor 102 (or a plurality of processorsconnected to the same cloud) may accumulate a large amount of behavioraldata for many types of actors 120, 122 that the processor 102 mayencounter. The cloud-based learning model may keep trying differentmodels to refine the behavioral model for each class of actor 120, 122,and those models may then be offered during inferencing to identifywhich one fits the best. In other words, the cloud-based learning modelmay be construed as an evolutionary learning model.

The processor 102 may execute the instructions 208 to apply a selectedcandidate model of the accessed candidate models 142 on the accesseddata to determine a predicted action of the identified actor 120. Theprocessor 102 may select one of the accessed candidate models 142 toapply on the accessed data based on, for instance, which of accessedcandidate models 142 most closely matches the accessed data. That is,for instance, the processor 102 may compare elements of the accesseddata, e.g., the location of the actor 120, the type of the actor 120,the direction in which the actor 120 is currently moving, whether a turnsignal of the actor 120 is active, a current time of day, a current dayof the week, whether another actor 122 is in the environment 130, and/orthe like, with respective inputs of the accessed candidate models 142 todetermine which of the candidate models 142 have inputs that match themost number of the elements of the accessed data among the candidatemodels 142. The processor 102 may select the candidate model 142 thathas inputs that match the most number of the elements among thecandidate models 142.

In addition, or in other examples, the elements of the accessed data maybe weighted according to respective importance levels assigned to theelements. For instance, the direction in which the actor 120 iscurrently moving may be weighted higher than the time of day. In theseexamples, the processor 102 may determine which of the candidate models142 have inputs that match the elements having the highest importancelevels. That is, the processor 102 may assign numeric values to theelements according to their importance levels and may determine valuesof the candidate models 142 based on a summation (or other mathematicalfunction) of the numeric values corresponding to the elements that matchthe inputs of the candidate models 142. In addition, the processor 102may select the candidate model 142 having the highest value to determinethe predicted action of the identified actor 120. The processor 102 mayselect the candidate model 142 using any other suitable method.

In any of the examples discussed herein, the processor 102 may determinethat multiple candidate models 142 may be selected for use indetermining the predicted action of the identified actor 120 when, forinstance, the multiple candidate models 142 have inputs that are equalto each other with respect to matching the elements of the accesseddata. In these instances, the processor 102 may select one of themultiple candidate models 142 based on any suitable criteria. Forinstance, the candidate models 142 may each be assigned a rating basedon the respective accuracies of the candidate models 142, the respectivepopularities of the candidate models 142, the ages of the candidatemodels 142, the lengths of time since the candidate models 142 werecreated and/or updated, and/or the like. In addition, the processor 102may select the candidate model 142 having the highest rating.

In other examples, the processor 102 may implement each of the candidatemodels 142 to determine respective predicted actions of the identifiedactor 120 corresponding to each of the candidate models 142. Inaddition, the processor 102 may determine that the identified actor 120may be predicted to perform each of the predicted actions.

The selected candidate model for the identified actor 120 may beconsidered as a first order model in that the selected candidate modelmay pertain directly to the identified actor 120 without considering amodel for a second actor 122 or the agent 110. According to examples,the selected candidate model may be a first order reinforcement learningmodel as shown in FIG. 3A, which depicts a diagram 300 of a first orderreinforcement learning model for the actor 120 according to anembodiment of the present disclosure.

As shown in FIG. 3A, the first order reinforcement learning model forthe actor 120 may take as inputs, the environment 130 and a reward attime (t−1) and may return an action (a) at time (t). The action (a(t))may be provided to the environment 130, which may produce the state (s)and the reward (r) at time (t+1). The processor 102 may determine thatthe actor 120 may interact with its environment 130 in discrete timesteps. The processor 102 may also determine the predicted actions of theactor 120 from a set of available actions as defined by the selectedcandidate model with the predicted action corresponding to, forinstance, the reward (r) having the highest value. In other words, theprocessor 102 may determine the predicted action that optimizes, e.g.,maximizes, the reward according to a reward policy. The reward policymay assign respective rewards to the predicted actions and the processor102 may select the candidate model that results in the predicted actionhaving the highest reward. In any regard, the processor 102 maydetermine a predicted action of the actor 120, e.g., as shown as dashedarrow 126 in FIG. 1 .

The processor 102 may execute the instructions 210 to implement arecursive reinforcement learning model using the predicted action of theidentified actor 120 to determine an action that the agent 110 is toperform. That is, the processor 102 may implement a second orderreinforcement learning model to determine the action that the agent 110is to perform. FIG. 3B shows an example diagram 310 of a recursive(e.g., second order) reinforcement learning model according to anembodiment of the present disclosure. In other examples, and asdiscussed herein, the recursive reinforcement learning model may be athird order reinforcement learning model or an even higher orderreinforcement learning model. Recursive reinforcement learning may bedefined as a technique where reinforcement learning may be appliedrecursively on a reinforcement learning model. That is, each layer inthe reinforcement learning model may be applied recursively to a higherorder layer, so that the higher order models may use a reinforcementlearning model to predict the behavior of other actors 120, 122 (modeledusing reinforcement learning) in the environment 130, to improve thepredictive power of the processor's 102 internal model so that theprocessor's 102 policy may determine predicted actions that mayultimately yield higher rewards.

As shown in FIG. 3B, the processor 102 may make a prediction about theexpected reward for the actor 120 in the environment 130, e.g., mayestimate the actor's 120 goals. In this regard, the processor 102 maynot explicitly receive from the actor 120 or a controller of the actor120, information pertaining to the actor's goals or the expected reward.Once the processor 102 has assigned a prediction for the reward for theactor 120, the processor 102 may modify the agent reward to align withthe predicted reward for the actor 120. For instance, the processor 102may create empathy by adding the predicted reward for the actor 120 tothe reward for the agent 110. Alternatively, the processor 102 maycreate sympathy in which the processor 102 may determine an action forthe agent 110 that may simply try to not reduce the reward for the actor120. In other words, the recursive reinforcement learning model may beprogrammed to cooperate with other actors 120, 122, to anticipate theirgoals, and facilitate achieving those goals. In some examples, theprocessor 102 may determine the action that optimizes, e.g., maximizes,the reward for the agent 110 according to a reward policy.

Although not shown, the processor 102 may make predictions about actionsby each of the actors 120, 122 in the environment 130. In theseexamples, the processor 102 may apply separate recursive reinforcementlearning models for each of the first order models applied to the actors120, 122. As a result, for instance, the processor 102 may predict thatthe first actor 120 will move in the direction denoted by the arrow 126and that the second actor 122 will move in the direction denoted by thearrow 128 at a time (t+1). In addition, the processor 102 may determinean action, e.g., movement as denoted by the arrow 132, that the agent110 is to perform based on the predicted movements of both of the actors120, 122. The processor 102 may also factor other considerations such asother objects in the environment 130 that may be in the path of theagent 110 or may other affect the movement of the agent 110.

In some examples, the processor 102 may, in making predictions aboutactions by the actor 120, determine that the actor 120 may be unable tosee or is unaware of the location and/or action of the second actor 122,e.g., the processor 102 may determine that the location and/or action ofthe second actor 122 is unavailable to the actor 120. In some instances,the action of the second actor 122 may not currently be visible to theactor 120 and the action may impact an action of the actor 120 and/orthe agent 110. However, the processor 102 may determine that the secondactor 122 is predicted to perform an action that may impact the actor120. In addition, the processor 102 may perform an action 132 that mayinform the actor 120 of the action that the second actor 122. By way ofparticular example in which the agent 110 is an autonomous vehicle, theprocessor 102 may determine that the agent 110 may see a pedestrian(second actor 122) crossing a street, but in a location where anothervehicle (actor 120) may not see the pedestrian, and the processor 102may predict that the other vehicle does not see the pedestrian. Inaddition, the processor 102 may cause the agent 110 to perform an actionsuch as, for instance, taking an appropriate evasive action, sounding analarm, honking a horn, and/or the like.

According to examples, the processor 102 may implement a recursivereinforcement learning model that is a third order reinforcementlearning model to determine the action that the agent 110 is to perform.A diagram 320 of a third order reinforcement learning model according toan embodiment of the present disclosure is depicted in FIG. 3C. As shownin FIG. 3C, the processor 102 may determine a first order predictedaction 322 of a first actor 120, a first order predicted action 324 of asecond actor 120, and a first order predicted action 326 of the agent110, for instance, through implementation of respective firstreinforcement learning models on the first actor 120, the second actor122, and the agent 110. The processor 102 may identify the first actor120 and may determine the first order predicted action 322 of the firstactor 120 in any of the manners discussed above. Similarly, theprocessor 102 may identify the second actor 122 in the environment 130from the accessed data and may access second candidate models 142 of thesecond actor 122, e.g., from the server 140. The processor 102 mayselect a model for the second actor 122 in a manner similar to thosediscussed above with respect to the first actor 120.

In addition, the processor 102 may apply a second candidate model of thesecond candidate models 142 to determine the first order predictedaction 324 of the second actor 122. That is, the processor 102 may applyreinforcement learning on the selected model in a manner similar to anyof the manners discussed above with respect to the actor 120 todetermine the first order predicted action 324 of the second actor 122.Thus, for instance, the processor 102 may determine the first orderpredicted action 324 of the second actor 122 through application of thesecond candidate model and an analysis of the accessed data about theenvironment 130. That is, the processor 102 may use the accessed dataabout the environment 130 as inputs into the second candidate model todetermine a plurality of predicted actions of the second actor 122resulting from the inputs. The processor 102 may further select thefirst order predicted action 324 that optimizes, e.g., maximizes, thereward for the second actor 122. The processor 102 may also determinethe first order predicted action 326 of the agent 110 in similar mannerswhile also factoring the first order predicted action 322 of the firstactor 120 and the first order predicted action 324 of the second actor122 as shown in FIG. 3C.

In these examples, and as shown in FIG. 3C, the processor 102 may applya recursive reinforcement learning model, for instance, by using thefirst order predicted action 326 of the agent 110 and the first orderpredicted action 324 of the second actor 122 as inputs in determining asecond order predicted action 126 of the first actor 120. That is, theprocessor 102 may use the first order predicted action 326 of the agent110 and the first order predicted action 324 of the second actor 122 asfactors in determining the second order predicted action 126 of thefirst actor 120. For instance, once the processor 102 has assignedpredictions for second order rewards for the second actor 122 and theagent 110, the processor 102 may modify the second order reward of thefirst actor 120 to align with the second order predicted rewards for thesecond actor 122 and the agent 110. By way of example, the processor 102may create empathy by adding the predicted rewards for the second actor122 and the agent 110 to the reward for the first actor 120.Alternatively, the processor 102 may create sympathy in which theprocessor 102 may determine a second order predicted action 126 for thefirst actor 120 that may simply try to not reduce the predicted rewardsfor the second actor 122 and/or the agent 110. In other words, therecursive reinforcement learning model for the first actor 120 may beprogrammed to cooperate with the second actor 122 and the agent 110 toanticipate the goals of the second actor 122 and the agent 110, andfacilitate achieving those goals.

As also shown in FIG. 3C, the processor 102 may apply a recursivereinforcement learning model, for instance, by using the first orderpredicted action 326 of the agent 110 and the first order predictedaction 322 of the first actor 120 as inputs in determining a secondorder predicted action 128 of the second actor 122. That is, theprocessor 102 may use the first order predicted action 326 of the agent110 and the first order predicted action 322 of the first actor 120 asfactors in determining the second order predicted action 128 of thesecond actor 122. For instance, once the processor 102 has assignedpredictions for second order rewards for the first actor 120 and theagent 110, the processor 102 may modify the second order reward of thesecond actor 122 to align with the second order predicted rewards forthe first actor 120 and the agent 110. By way of example, the processor102 may create empathy by adding the predicted rewards for the firstactor 120 and the agent 110 to the reward for the second actor 122.Alternatively, the processor 102 may create sympathy in which theprocessor 102 may determine a second order predicted action 128 for thesecond actor 122 that may try to not reduce the predicted rewards forthe first actor 120 and/or the agent 110. In other words, the recursivereinforcement learning model for the second actor 122 may be programmedto cooperate with the first actor 120 and the agent 110 to anticipatethe goals of the first actor 120 and the agent 110, and facilitateachieving those goals.

As discussed above, the processor 102 may determine the second orderpredicted action 126 of the first actor 120 using a second orderreinforcement learning model in which the first order predicted action324 of the second actor 122 and the first order predicted action 326 ofthe agent 110 may be used in the prediction for the reward for thesecond order predicted action 126 of the first actor 120. Thus, forinstance, the processor 102 may determine the second order predictedaction 126 of the first actor 120 through application of the selectedcandidate model and an analysis of the accessed data about theenvironment 130. That is, the processor 102 may use the accessed dataabout the environment 130 as inputs into the selected candidate model todetermine a plurality of predicted actions of the first actor 120resulting from the inputs. The processor 102 may further select thesecond order predicted action 126 that optimizes the reward for thefirst actor 120.

The processor 102 may also determine the second order predicted action128 of the second actor 122 using a second order reinforcement learningmodel in which the first order predicted action 324 of the first actor120 and the first order predicted action 326 of the agent 110 may beused in the prediction for the reward for the second order predictedaction 128 of the second actor 122. Thus, for instance, the processor102 may determine the second order predicted action 128 of the secondactor 122 through application of the selected candidate model and ananalysis of the accessed data about the environment 130. That is, theprocessor 102 may use the accessed data about the environment 130 asinputs into the selected candidate model to determine a plurality ofpredicted actions of the second actor 122 resulting from the inputs. Theprocessor 102 may further select the second order predicted action 128that optimizes the reward for the second actor 122.

In addition, as shown in FIG. 3C, the processor 102 may determine athird order predicted action 132 that the agent 110 is to perform usinga third order reinforcement learning model in which the second orderpredicted action 126 of the first actor 120 and the second orderpredicted action 128 of the second actor 122 may be used in theprediction for the reward for the third order predicted action 132 ofthe agent 110. The processor 102 may determine the third order predictedaction 132 in any of the manners discussed herein with respect to FIG.3B. Thus, for instance, the processor 102 may determine the third orderpredicted action 132 using the second order predicted action 126 of thefirst actor 120 and the second order predicted action 128 of the secondactor 122, in which each of the second order predicted actions 126, 128may have been determined using the first order predicted actions 322,324, 326 of the first actor 120, the second actor 122, and the agent110.

According to examples, the processor 102 may determine the third orderpredicted action 132 of the agent 110 through application of the thirdorder reinforcement learning model and an analysis of the accessed dataabout the environment 130. That is, the processor 102 may use theaccessed data about the environment 130 as inputs into the third orderreinforcement learning model to determine a plurality of predictedactions of the agent 110 resulting from the inputs. The processor 102may further select the third order predicted action 132 that optimizesthe reward for the agent 110. In some examples, the processor 102 mayuse fuzzing and/or simulations, e.g., Monte Carlo simulations, toanalyze the outcomes of multiple candidate actions of the agent 110 andmay predict the outcomes over a short time in the future. In theseexamples, the processor 102 may select the action 132 that is predictedto result in an optimal result, e.g., that results in the highest rewardfor the agent 110.

According to examples, the processor 102 may upload the determinedaction 132 that the agent 110 is to perform to a server, e.g., theserver 140, via the network 150. In these examples, the server 140 mayupdate the third order model that the processor 102 used to determinethe action 132 to incorporate the action 132. That is, for instance, thethird order model may be a machine learning model and the server 140 maytrain the third order model with the action 132 as an output of thethird order model. The processor 102 may also upload the candidatemodels that the processor 102 used to determine the predicted actions126, 128 of the actors 120, 122 to the server 140 and the server 140 mayupdate the candidate models 142 of the actors 120, 122 based on thepredicted actions 126, 128. That is, the server 140 may use thepredicted actions 126, 128 as respective outputs to the candidate models142 in training the candidate models 142.

The processor 102 may execute the instructions 212 to cause, e.g.,instruct, the agent 110 to perform the determined action 132. Inexamples in which the processor 102 is integrated with the agent 110,the processor 102 may control the agent 110 to perform the determinedaction 132. In examples in which the processor 102 is separate from theagent 110, the processor 102 may communicate an instruction to the agent110 to perform the determined action 132.

The processor 102 may further continue to track the actions of theactors 120, 122 and may upload information pertaining to the trackedactions to the server 140. The server 140 may use the uploadedinformation to update the models 142 of the actors 120, 122. The server140 may also receive other information pertaining to the tracked actionsof the actors 120, 122 from the actors 120, 122 as well as from otheractors and may use the other information to update the models. In any ofthese examples, the server 140 may continuously update the models 142,may update the models 142 ate various intervals of time, or the like. Asa result, the models 142 may continuously be updated to more accuratelypredict actions that the actors 120, 122 may perform.

Various manners in which the processor 102 of the apparatus 100 mayoperate are discussed in greater detail with respect to the methods 400and 500 respectively depicted in FIGS. 4 and 5 . Particularly, FIGS. 4and 5 , respectively depict flow diagrams of methods 400 and 500 forapplying a third order reinforcement learning model to determine anaction that an agent 110 is to perform, in accordance with embodimentsof the present disclosure. It should be understood that the methods 400and 500 respectively depicted in FIGS. 4 and 5 may include additionaloperations and that some of the operations described therein may beremoved and/or modified without departing from the scopes of the methods400 and 500. The descriptions of the methods 400 and 500 are made withreference to the features depicted in FIGS. 1-3C for purposes ofillustration.

With reference first to FIG. 4 , at block 402, the processor 102 mayaccess data about an environment 130 of an agent 110. As discussedherein, the agent 110 may include a sensor 114, e.g., a camera, that maycapture images and/or video of the environment 130 in which the agent110 is located. In these examples, the processor 102 may access capturedimages of the environment 130.

At block 404, the processor 102 may identify a first actor 120 and asecond actor 122 in the environment 130. For instance, the processor 102may identify the first actor 120 and the second actor 122 throughapplication of an object recognition program on the captured imagesand/or video as discussed herein. Although particular reference is madeto a first actor 120 and a second actor 122 being identified from theaccessed data, it should be understood that additional actors may alsobe identified.

At block 406, the processor 102 may apply a first order reinforcementlearning model on the second actor 122 to determine a first orderpredicted action 324 of the second actor 122. The processor 102 maydetermine the first order predicted action 324 through implementation ofa selected candidate model as discussed herein. For instance, theprocessor 102 may determine the first order predicted action 324 of thesecond actor 122 by determining the first order predicted action 324 tobe an action that optimizes a reward determined through application ofthe first order reinforcement learning and an analysis of the accesseddata about the environment 130. In addition, the processor 102 maydetermine the first order predicted action 324 of the second actor 122by determining the first order predicted action 324 through analysis ofa predicted action of the first actor 120 and the agent 110 from theperspective of the second actor 122. That is, the processor 102 maydetermine predicted actions of the first actor 120 and the agent 110from the perspective of the second actor 122 and may determine the firstorder predicted action 324 of the second actor 122 based on thepredicted actions of the first actor 120 and the agent 110 as discussedabove with respect to FIG. 3C. In other words, the processor 102 maydetermine the actions 322, 326 as may be predicted by the second actor122.

At block 408, the processor 102 may apply a second order reinforcementlearning model on the first actor 120 to determine a second orderpredicted action 126 of the first actor 120. As discussed herein, thesecond order reinforcement learning model may use a predicted reward ofthe first order predicted action 324 of the second actor 122 todetermine the second order predicted action 126 of the first actor 120.The processor 102 may determine the second order predicted action 126through implementation of a selected candidate model as discussedherein. For instance, the processor 102 may determine the second orderpredicted action 126 of the first actor 120 by determining the secondorder predicted action 126 to be an action that optimizes a rewarddetermined through application of the second order reinforcementlearning and an analysis of the accessed data about the environment 130.In addition, the processor 102 may determine the second order predictedaction 126 of the first actor 120 by determining the second orderpredicted action 126 through analysis of first order predicted actions324, 326 of the second actor 122 and the agent 110. That is, theprocessor 102 may predict actions of the second actor 122 and the agent110 and may determine the second order predicted action 126 of the firstactor 120 based on the first order predicted actions 324, 326 of thesecond actor 122 and the agent 110.

At block 410, the processor 102 may apply a third order reinforcementlearning model on the agent 110 to determine an action 132 that theagent 110 is to perform. The third order reinforcement learning modelmay use a predicted reward of the second order predicted action 126 todetermine the action 132 that the agent 110 is to perform. As discussedherein, the third order reinforcement learning model may use a predictedreward of the second order predicted action 126 to determine the action132 that the agent 110 is to perform. The processor 102 may determinethe action 132 through implementation of a model as discussed herein.For instance, the processor 102 may determine the action 132 of theagent 110 by determining the action 132 to be an action that optimizes areward determined through application of the third order reinforcementlearning and an analysis of the accessed data about the environment 130.In addition, the processor 102 may determine the action 132 that theagent 110 is to perform by determining the action 132 through analysisof second order predicted actions 126, 128 of the second actor 122 andthe first actor 120. That is, the processor 102 may determine secondorder predicted actions 126, 128 of the second actor 122 and the firstactor 120 and may determine the action 132 that the agent is to performbased on the determined second order predicted actions 126, 128 of thesecond actor 122 and the first actor 120 as discussed herein withrespect to FIG. 3C.

According to examples, the processor 102 may simulate multiple candidateactions of the agent 110, e.g., via implementation of Monte Carlosimulations, and may select an action that is predicted to result in anoptimal result as the action 132 that the agent 110 is to perform. Theaction that is predicted to result in the optimal result may be, forinstance, the action that results in a highest reward among thecandidate actions of the agent 110.

At block 412, the processor 102 may cause, e.g., instruct, the agent 110to perform the determined action 132. As discussed herein, the processor102 the processor 102 may control the agent 110 to perform thedetermined action 132 and/or communicate an instruction to the agent 110to perform the determined action 132.

Turning now to FIG. 5 , at block 502, the processor 102 may access dataabout an environment 130 of an agent 110 in a manner similar to any ofthose discussed above with respect to block 402. At block 504, theprocessor 102 may identify a first actor 120 and a second actor 122 inthe environment 130 in a manner similar to any of those discussed abovewith respect to block 402.

At block 506, the processor 102 may apply a first order reinforcementlearning model on the second actor 122 to determine a first orderpredicted action 324 of the second actor 122. In addition, at block 508,the processor 102 may apply a second order reinforcement learning modelon the first actor 120 to determine a second order predicted action 126of the first actor 120. The operations in blocks 506 and 508 may besimilar to blocks 406 and 408 discussed herein with respect to FIG. 4 .

At block 510, the processor 102 may apply a first order reinforcementlearning model on the first actor 120 to determine a first orderpredicted action 322 of the first actor 120. The processor 102 maydetermine the first order predicted action 322 through implementation ofa selected candidate model as discussed herein. For instance, theprocessor 102 may determine the first order predicted action 322 of thefirst actor 122 by determining the first order predicted action 322 tobe an action that optimizes a reward determined through application ofthe first order reinforcement learning and an analysis of the accesseddata about the environment 130. In addition, the processor 102 maydetermine the first order predicted action 322 of the first actor 120 bydetermining the first order predicted action 322 through analysis of apredicted action of the second actor 122 and the agent 110 from theperspective of the first actor 120. That is, the processor 102 maydetermine predicted actions of the second actor 122 and the agent 110from the perspective of the first actor 120 and may determine the firstorder predicted action 322 of the first actor 120 based on the predictedactions of the second actor 122 and the agent 110 as discussed abovewith respect to FIG. 3C. In other words, the processor 102 may determinethe actions 324, 326 as may be predicted by the first actor 120.

At block 512, the processor 102 may apply a second order reinforcementlearning model on the second actor 122 to determine a second orderpredicted action 128 of the second actor 122. As discussed herein, thesecond order reinforcement learning model may use a predicted reward ofthe first order predicted action 322 of the first actor 120 to determinethe second order predicted action 128 of the second actor 122. Theprocessor 102 may determine the second order predicted action 128through implementation of a selected candidate model as discussedherein. For instance, the processor 102 may determine the second orderpredicted action 128 of the second actor 122 by determining the secondorder predicted action 128 to be an action that optimizes a rewarddetermined through application of the second order reinforcementlearning and an analysis of the accessed data about the environment 130.In addition, the processor 102 may determine the second order predictedaction 128 of the second actor 122 by determining the second orderpredicted action 128 through analysis of first order predicted actions322, 326 of the first actor 120 and the agent 110. That is, theprocessor 102 may predict actions of the first actor 120 and the agent110 and may determine the second order predicted action 128 of thesecond actor 120 based on the first order predicted actions 322, 326 ofthe first actor 120 and the agent 110.

At block 514, the processor 102 may apply a third order reinforcementlearning model on the agent 110 to determine an action 132 that theagent 110 is to perform. The third order reinforcement learning modelmay use a predicted reward of the second order predicted actions 126,128 of the first actor 120 and the second actor 122 to determine theaction 132 that the agent 110 is to perform. As discussed herein, thethird order reinforcement learning model may use a predicted reward ofthe second order predicted actions 126, 128 to determine the action 132that the agent 110 is to perform. The processor 102 may determine theaction 132 through implementation of a model as discussed herein. Forinstance, the processor 102 may determine the action 132 of the agent110 by determining the action 132 to be an action that optimizes areward determined through application of the third order reinforcementlearning and an analysis of the accessed data about the environment 130.

At block 516, the processor 102 may instruct the agent 110 to performthe determined action 132. In addition, or alternatively, the processor102 may upload the determined action 132 that the agent 110 is toperform to a server, e.g., the server 140, via the network 150.

Some or all of the operations set forth in the methods 400 and 500 maybe included as utilities, programs, or subprograms, in any desiredcomputer accessible medium. In addition, the methods 400 and 500 may beembodied by computer programs, which may exist in a variety of formsboth active and inactive. For example, they may exist as machinereadable instructions, including source code, object code, executablecode or other formats. Any of the above may be embodied on anon-transitory computer-readable storage medium.

Examples of non-transitory computer-readable storage media includecomputer system RAM, ROM, EPROM, EEPROM, and magnetic or optical disksor tapes. It is therefore to be understood that any electronic devicecapable of executing the above-described functions may perform thosefunctions enumerated above.

Turning now to FIG. 6 , there is shown a block diagram of acomputer-readable medium 500 that may have stored thereoncomputer-readable instructions for determining an action 132 that anagent 110 is to perform using predicted actions of a first actor 120 anda second actor 122, in accordance with an embodiment of the presentdisclosure. It should be understood that the computer-readable medium600 depicted in FIG. 6 may include additional instructions and that someof the instructions described herein may be removed and/or modifiedwithout departing from the scope of the computer-readable medium 600disclosed herein. The computer-readable medium 600 may be anon-transitory computer-readable medium, in which the term“non-transitory” does not encompass transitory propagating signals.

The computer-readable medium 600 may have stored thereonmachine-readable instructions 602-610 that a processor, such as theprocessor 102 depicted in FIGS. 1 and 2 , may execute. Thecomputer-readable medium 600 may be an electronic, magnetic, optical, orother physical storage device that contains or stores executableinstructions. The computer-readable medium 600 may be, for example,Random Access memory (RAM), an Electrically Erasable ProgrammableRead-Only Memory (EEPROM), a storage device, an optical disc, and thelike.

The processor may fetch, decode, and execute the instructions 602 toaccess data 116 about an environment 130 of an agent 110. The processormay access the data 116 in any of the manners discussed herein.

The processor may fetch, decode, and execute the instructions 604 toidentify a first actor 120 and a second actor 122 within the environment130. The processor may fetch, decode, and execute the instructions 606to determine a second order predicted action 126 of the first actor 120,in which the first order predicted action 324 of the second actor 122and a first order predicted action 326 of the agent 110 may be used asfactors in determining the second order predicted action 126 of thefirst actor 120. The processor may fetch, decode, and execute theinstructions 608 to determine a second order predicted action 126 of thesecond actor 122, in which the first order predicted action 322 of thefirst actor 120 and a first order predicted action 326 of the agent 110may be used as factors in determining the second order predicted action128 of the second actor 122. The processor may fetch, decode, andexecute the instructions 610 to determine an action 132 that the agent110 is to perform using the second order predicted action 126 of thefirst actor 120 and the second order predicted action 128 of the secondactor 122 as factors. The processor may fetch, decode, and execute theinstructions 612 to cause the agent 110 to perform the determined action132.

As discussed herein, the processor may implement a recursivereinforcement learning model to determine the action 132 that the agent110 is to perform. As also discussed herein, the processor may simulateoutcomes of multiple candidate actions that the agent 110 is to performand may select an action that is predicted to result in an optimalreward for the agent 110 as the action that the agent 110 is to perform.

In some examples, the processor may determine the first order predictedaction 322 using both the first order predicted action 324 of the secondactor 122 and a first order predicted action 326 of the agent 110 from aperspective of the first actor 120. That is, the processor may predicthow the first actor 120 may predict the action of the second actor 122and the action of the agent 110. For instance, the processor may apply afirst order model of the second actor 122 to predict the first orderpredicted action 324 of the second actor 122 and another first ordermodel of the agent 110 to predict the first order predicted action 326of the agent 110, in which the first order models are from theperspective of the first actor 120. In addition, the processor may inputthe predicted actions of the second actor 122 and the agent 110 into asecond order model of the first actor 120. The processor may alsodetermine the second order predicted action 126 through implementationof the second order model of the first actor 120.

In addition or in other examples, the processor may determine the secondorder predicted action 128 using both the first order predicted action322 of the first actor 120 and a first order predicted action 326 of theagent 110 from a perspective of the second actor 122. That is, theprocessor may predict how the second actor 122 may predict the firstorder predicted action 322 of the first actor 120 and the first orderpredicted action 326 of the agent 110. For instance, the processor mayapply a first order model of the first actor 120 to predict the firstorder predicted action 322 of the first actor 120 and another firstorder model of the agent 110 to predict the first order predicted action326 of the agent 110, in which the first order models are from theperspective of the second actor 122. In addition, the processor mayinput the first order predicted actions 322, 326 of the first actor 120and the agent 110 into a second order model of the second actor 122. Theprocessor may also determine the second order predicted action 128through implementation of the second order model of the second actor122.

Reference is now made to FIG. 7 , which depicts a diagram 700 of areinforcement learning model according to an embodiment of the presentdisclosure. It should be understood that the diagram 700 depicted inFIG. 7 may include additional elements and that some of the elementsdescribed herein may be removed and/or modified without departing fromthe scope of the diagram 700 disclosed herein.

As shown in FIG. 7 , the processor 102 may implement a multi-partreinforcement learning model for the agent 110. The multi-partreinforcement learning model may include a perceptive model 710 that mayincorporate inputs from the agent's 110 environment 130 as well as areward feedback and predicted control actions from a control model 704.As discussed herein, the sensory input may be snapshots, which may bereferenced as frames, e.g., which may also be analogous to a video inputframe and simultaneously concurrent input from other input sensors atthat time. The processor 102 may record input sequentially as frames andthe perceptive model 710 may be an autoencoder, which may concurrentlybe trained on a predictive model 706.

The predictive model 706 may attempt to predict the next frame, using along short-term memory (LSTM) or similar recurrent model. In someexamples, the predictive model may be run iteratively on output of thepredictive model 706. This may be construed as an imagination mode, inthe sense that the predictive model 706 may simply be imagining whatmight occur next.

The third model may be construed as the control model 704, which may bea simple (and much smaller) reinforcement learning model, which may tryto maximize the reward based on a reward policy. The output of thecontrol model 704 may be the action 132 that the agent 110 is toperform, which may correspond to direct outputs to the agent's 110motors and/or other control mechanisms. Conveniently, the output of thepredictive model 706 may include an expected reward, which may allow thepredictive model 706 to be run on a number of “what if” scenarios to tryto predict the expected reward from various actions.

The fourth model may be construed as the perspective model 702, whichmay also be based on the predictive model 706. The perspective model 702may predict the sensory input from another perspective in theenvironment 130. This may allow the perspective model to attempt topredict what other actors 120, 122 might see, or otherwise detect fromtheir respective sensors.

The fifth model may be construed as the agent model 708, which may beresponsible for modeling and predicting the behavior of the actors 120,122 in the environment 130. The agent model 708 may be designed to modeland represent how the processor 102 is making its decisions for theagent 110. The agent model 708 may take many different forms, but in itsmost general form, the agent model 708 may contain a duplicate of allthe major elements of a full model. The agent model 708 may be a secondorder model and may include a perceptive model, a predictive model, acontrol model, a perspective model, and even agent models representingthe other actors 120, 122. However, to avoid an infinite recursion,these second order agent models, while they may also include the samedecision making processes of the primary model, may not include thirdorder agent models again representing the other agents. The second orderagent models may include only perceptive, predictive, and controlmodels.

The perceptive model 710 may produce an autoencoder vector (agent statevector v(t)) by synthesizing the sensory input, a prediction of a rewardfeedback, and also a prediction of the control model 704 output. Inother words, the perceptive model 710 may encode the images collected bythe sensor 114 of the agent 110, what current control state is in termsof where the agent 110 is going, and an expected reward for the currentconditions. So, for instance, if the agent 110 were at the edge of acliff, and the control output has all of the wheels moving full speedahead, the “next frame” would be the agent 110 falling off the cliff,and a large negative reward, assuming that the agent 110 were programedby a reward policy not to reward the agent 110 for driving off a cliff.

In some examples, a perceptive vector (agent state vector v(t)) may besufficiently large to incorporate not only the visual and other sensoryinputs, but also a cognitive interpretation of things going on in theenvironment 130, in particular recognition of specific actors 120, 122in the environment 130. Also, the perceptive vector may include encodingof some objects in the images that the sensor 114 may have collected inprevious frames, and an overall perception of the environment 130. Byway of example in which the agent 110 is a robot that is learning tonavigate a maze, the perceptive vector may encode in some way the entiremaze, such that the processor may make some intelligent predictionsabout how to navigate the maze. In other words, the whole perceptivevector may encode a complete representation of the robot's knowledge ofthe world, including things that the robot may not currently see, butmay have seen in previous frames.

This feature of the perceptive model 710 may allow additionalextrapolation from the perceptive vector to produce sensory input from adifferent perspective, for example 3 feet in front of the robot butfacing the other way. To optimize training of the perspective model 702,the processor 102 may sense spatial data, which may optimally berecorded using a stereo camera input configuration, e.g., of the sensor114. Training may be optimized by having multiple cameras in differentangles, including side and rear view cameras, as well as stereo visionto allow the robot to see and encode distance to objects.

The predictive model 706 may concurrently be trained with the perceptivemodel 710 using, for instance, an LSTM, to try to predict the next frameof sensory input from the perceptive model 710. The predictive model 706may be trained on a sequence of output vectors of the perceptive model710 and may learn to predict the behavior of a complex environment, sothe output vector of the predictive model 706 may also have highdimension, to encode the vast complexity of this environment. Thepredictive model 706 may learn the rules of the environment to predictthe outcome of certain actions as well as anticipating what will happennext independently of the agent's 110 own action.

The predictive model 706 may not be given the role of predicting whatthe actors 120, 122 may do. Instead, that may be the role of the agentmodel 708. By way of example, the agent model 708 may encode the modeltemplate for each actor 120, 122 in the agent's 110 environment 130.That template may be used to run an independent model for each actor120, 122 in the agent's 110 environment 130. The template may encode avariety of different behavioral models, which may be assumed to besimilar in structure to the multiple models used to control the agent110. As such, a model may be generated and assigned to each actor 120,122, and the outputs of those models, for instance, in the form ofpredictions of what those actors 120, 122 will do, may be used as aninput to the predictive model 706 for the agent 110. In addition, thecomplex behavior of the actors 120, 122 in the agent's 110 environmentmay collectively be taken as an input to the predictive model 706 of theagent 110. As a result, the predictive model 706 may predict the outputof the next frame of the perceptive model 710, given the expectedactions as predicted by the agent model 708 for each actor 120, 122 inthe agent's 110 environment.

For example, if the agent 110 is facing an actor 120, there may be anagent model 708 assigned to that actor 120. The agent model 708 maypredict that the actor 120 is going to turn right, and that predictionmay be an input into the predictive model 706 for the agent 110. Assuch, the predictive model 706 may simply attempt to predict what theagent 110 is going to see in the next frame, given the agent model 708for the actor 120 is predicting it is going to turn to the right. Inother words, the predictive model 706 may predict what happens nextgiven the anticipated behavior of the actor 120.

The control model 704 may take as an input the current output of theperceptive model 710, together with the output from the predictive model706 not only for the next frame, but also for a variety of expectedfuture frames. The control model 704 may use a Monte Carlo simulation,as well as the perceptive model's 710 output at several points in thepast, so that the control model 704 has a rich set of data for thecurrent predictions. The control model 704 may then be trained using areinforcement learning algorithm to try to maximize the reward given areward policy. Due to the time sequence nature of the input to thecontrol model 704, the control model 704 may be trained using arecurrent model analogous to an RNN or LSTM.

The perceptive model 710 may include encoded output form the agent model708, so the input to the control model 704 may include a prediction ofthe behavior of one or more actors 120, 122 currently in the agent's 110environment 130, even potentially actors that are not visible in thecurrent frame, but may be visible in future or previous frames. As aresult, the control model 704 may make decisions based on predictions ofthe behavior of other actors 120, 122, using behavioral modelscustomized to each of the actors 120, 122.

The control model 704 may be optimized to run through a variety ofpossible, e.g., “what if” scenarios, to, for instance, enable thecontrol model 704 to essentially learn “on the fly.” The predictivemodel 706 may be continuously learning and being updated with new data,while at the same time being run forward a few seconds to try to predictwhat's coming next. The processing for the predictive model 706 may bedone using cloud resources, e.g., the server 140, may perform theprocessing for executing the predictive model 706.

By way of particular example in which the predictive model 706 may havedetected a large reward, either positive or negative, the control model704 may cause the processor 102 to command the agent 110 to stop and mayprocess the agent's 110 actions. The control model 704 may also bere-trained using a short term predictive model, and a new action plannedto react to the reward. This may cause the processor 102 to determineactions of the agent 110 on the fly.

According to examples, the reinforcement learning architecture disclosedherein may allow the perceptive model 710 and the predictive model 706to predict anticipated outcomes for a short time in the future. In otherwords, the perceptive model 710 and the predictive model 706 may predictwhat may happen. There are two ways in which may be applied. First, theprocessor 102 may consider what-if scenarios. In some cases, theprocessor 102 may constantly be running the predictive model 706 a fewseconds into the future to anticipate the immediate consequences of theactions of the agent 110. This prediction may be run through the rewardpolicy to predict or anticipate a positive or negative reward.Additionally, the predictive model 706 may include fuzzing using a MonteCarlo simulator, so that the predictive model 706 may predict not justone outcome but multiple likely scenarios. If the predictive model 706predicts, or anticipates, a positive or negative reward a short timeinto the future, the processor 102 may pause, and may run more modelsimulations to establish a likely control policy that will result in apositive reward or avoid a negative reward. In other words, in anexample in which the processor 102 determines that the agent 110 isabout to bump into an object, the predictive model 706 may detect thataction and the processor 102 may re-plan dynamically to avoid theaction.

Second, the processor 102 may learn dynamically in the environment 130.This may be enhanced even more if the correct reward policies areapplied to the control model 704, in particular to contradict negativebehaviors that may particularly be unhelpful. For example, an agent's110 model may end up exhibiting odd behavior, such that the agent 110gets stuck in a corner, runs into a wall, extends an actuator into anobstacle, or in the worst case hurts a person or damages property.Correct application of sensor feedback, for example, collision sensorsor tactile sensors, may apply a negative feedback loop. In thisscenario, the processor 102 may run a series of Monte Carlo simulationsusing fuzzing of the predictive model 706, and then re-train the controlmodel 704 to optimize the reward policy to extricate the agent 110 frombeing stuck. In other words, Monte Carlo simulations using the worldmodel may be used for limited problem solving capability. By running aseries of simulated futures, the processor 102 may find a future inwhich the agent 110 gets unstuck, and then re-train the control model704 using this data to get out of whatever situation it has become stuckin, or in some cases avoid getting into the stuck position in the firstplace, because the processor 102 may also be running this simulationinto the future and potentially identify a negative outcome before ithappens.

The perspective model 702 may allow the processor 102 to makepredictions of what the agent 110 may see from another perspectivelocation. Training the perspective model 702 to make these types ofpredictions may be performed by accumulating sensory input over time andas the agent 110 explores the environment 130. For instance, the agent110 may accumulate sensory data from many locations in the environment130. At some point in the future, or the past, the agent 110 may be inanother position in the environment 130. This data may be used to trainthe perspective model 702, which may allow the perceptive model 710 toform an internal map of the environment 130. After accumulatingsufficient input, the processor 102 may determine, for example, afterturning a certain corner in a maze environment, the processor 102 maydetermine what was “behind” the agent 110 because the agent 110 had beenthere previously, and also had previously been through this junction inboth directions multiple times. This may be construed as a spatialawareness model in the agent 110. The spatial awareness model may betrained by integrating many input frames over time as the agent 110explores the environment 130. Also, to optimize training of theperspective model 702, the agent's 110 reward policy may be optimizedfor curiosity or exploration behavior in the agent 110, so that theagent 110 may explore a new environment to accumulate data to train theperspective model 702, and indirectly the perceptive model 710, whichmay ultimately encode this data.

Another way to understand the structure of the predictive model 706 forthe processor 102 is as a matryoshka model. In a sense, the predictivemodel 706 may be construed as three predictive control models nestedwithin one another. The inner model is a basicperceptive/predictive/control model, which may be referred to as a worldmodel. The inner model may have the ability to make predictions, dream,or imagine, and make control decisions based on these predictions, aswell as the past history of observations. The second layer is similar,but adds an inner perceptive/predictive/control world model to try topredict the behavior of other actors identified by the perceptive model710. The perceptive model 710 may add the perspective to try to predictwhat the perceptive model 710 would be providing from the other actor'sperspective. And, the agent model 708 in this case is may use theperceptive/predictive/control world model of the inner model. The thirdlayer is again similar to the second layer, except in this case theinner model, e.g., the agent model 708, may be the complete second layeras just described. The third layer is the actual model that theprocessor 102 may use, using the second layer inner model as the agentmodel 708 for the other actors the processor 102 identifies, and in thatmodel, the processor 102 may use a perceptive/predictive/control worldmodel as the agent model 708. As a result, the predictive model 706 maybe like a matryoshka doll, e.g., three models wrapped within each other,each making predictions based on what the model learns from the innermodel.

As discussed herein, the processor 102 may identify multiple actors 120,122 in the environment 130, which the processor 102 may collectivelyencode as an agent vector. The agent vector may be trained to encode alist of actors 120, 122, including a best guess at the policiesrespectively controlling the actors 120, 122 and their perspectives ofthe perception vector. The agent vector may be sufficient to predict thebehavior in terms of control output for each actor 120, 122 in theagent's 110 environment 130. There may be at any given time multipleactors 120, 122 in the environment 130, encoded using a recurrent modelwhich will output the next actor given a list of previous actors, untilthe recurrent model reaches the last actor, in which case the recurrentmodel may output a terminator agent vector. This may be analogous to theway a character based RNN outputs a word as a sequence of characters,with a stop character at the end of the word.

Each individual agent model 708 may take many different forms, dependingon what kind of actor the agent model it is trying to model. However, inthe simplest case, the agent model may be modeling another actor whichis using the same method to model the behavior of all the other actorsas well. In this general form, the agent model may include its ownperceptive, predictive, control, and agent models. In addition, thesecond order agent model, the representation of other agents in theother actors may contain a perspective, predictive, and control model,e.g., without an agent model.

The agent model 708 may accommodate multiple model templates fordifferent classes of actors 120, 122 that the agent model 708 mayidentify. However, this may become a sequence of agent models that mayultimately be accumulated into a uniform autoencoder output vector. Forthis reason, the output of the individual agent models may be uniform,so that the individual agent models may be combined uniformly. Theoutput may be generic, including predicted actions such as rotation andmotion in all directions, including vertical for agents which may becapable of independent vertical movement, and/or the like. The uniformautoencoder output vector may also include predictions of relativemotion. In addition, behavior for a short period of time into the futuremay be predicted, which may either be a fixed sequence of repeatedaction and motion vectors, or the same vector autoencoded into a smallervector, that may semantically express ideas, such as, for instance, theactor will move forward three feet, stop, then turn to the left.

According to examples, the agent vector may be an autoencoded expressionof the actor's intention, a kind of model specific short term memory,which may be passed into the model for repeated iterations. This mayallow the primary model to essentially play forward what the otheractors may be planning to do over an indefinite time period.

In some examples, the agent model predictions may not predict a singlespecific outcome. Instead, each of the agent models 708 may estimate arange of different actions with different levels of confidence, forinstance, through implementation of a Monte Carlo simulation. By way ofexample, the Monte Carlo simulation may simulate small random variationsin both the model and the inputs, and then, using a clustering algorithmto both group the outcomes and then count the similar outcomes in eachcluster. The mean outcome of each cluster may be taken as arepresentative sample, and then the probability of that outcome may betaken proportional to the counts of each outcome. Alternately, apredictive model 706 may be used that predicts a range of outcomes withdifferent probabilities. In either of these examples, the output fromthe individual agent models 708 may be a range of possible actionstogether with the likelihood of each action, in descending likelihood.

In some examples, the agent models 108 may not be limited to actors thatare automated. Instead, a similar model may be applied to persons,animals, or non-automated machines. For people or animals, an agentmodel 108 that may essentially be identical to the agent model for arobot may provide a crude but adequate simulation. In other words, itmay not necessarily be important to model people and machinesseparately. A simplifying assumption may be made that all intelligentagents make decisions with the same simple model.

In some examples, the processor 102 may be connected to a globalcloud-based data repository and learning service, which may, forinstance, include the server 140. In these examples, observations may begathered to continuously re-train global perceptive, perspective,predictive, control, and agent models. As these models are re-trained inreal time, the processor 102 may update its models to reflect the mostcurrent data. The cloud service may also employ the feature of thepredictive model 706 that may allow the predictive model 706 to re-playhypothetical scenarios, extrapolate into the future, and use thistraining data to refine the control models 704. The cloud service mayalso help the processor 102 fit the most likely agent models 708 tospecific objects in the real world. For example, the agent models 708may find not only that a general class of objects, for example red cars,behave in a certain way, for example they tend to drive fast and makerisky decisions. The agent models 708 may also make generalizationsabout specific objects, for example the red car which pulls out of aspecific driveway each morning at a specific time. The cloud service maylearn that that particular vehicle behaves in a specific way, and makesspecific decisions. The cloud service may share this informationglobally, for instance, across all processors that may access the cloudservice for the models. The cloud service may also correlate specificdata, for example on a specific day a specific red car might alreadyhave driven out of a specific driveway, so that other agents 110 thatmay pass the same driveway later in the day may make differentassumptions about what may occur, specifically that that red car hasalready left and is not likely to come out of the driveway again thatmorning. The cloud service may share these observations globally tothereby connect all of the processors, e.g., agents, which connect tothe cloud service.

According to examples, certain data collected by the agent 110 may bedeleted, censored, redacted, and/or be able to train the models on thatcertain data in such a way that may guarantee a certain level ofprivacy, such as using differential privacy. This may be optional andconfigurable, and may also comply with privacy regulations. For example,the processor 102 may be prevented from using facial recognition if suchuse is either a violation of law or a person's privacy.

An example of the present disclosure is provided in which the agent 110is a self-driving car approaching an intersection at which a pedestrianis waiting to cross at a cross walk. In this example, the agent 110 maystop to allow the pedestrian to cross only if the processor 102 is awarethat the pedestrian is aware of the agent 110. In other words, theprocessor 102 may be able to model the pedestrian's behavior, andidentify that circumstances that are preventing the pedestrian fromcrossing at the intersection involve the agent 110 itself. That is, theprocessor 102 may be aware that the agent 110 approaching theintersection may be the reason the pedestrian is waiting to cross. Thisawareness may allow the processor 102 to make a determination to changethe action 132 of the agent 110 to modify the action of another actor.In other words, the processor 102 may determine that the agent 110 is tostop prior to the intersection to allow the pedestrian to cross. Thisbehavior may not arise as an outcome of any other behavioral model thatdoes not include this awareness.

Another way to describe this behavioral pattern is to say the processor102 may empathize with other actors. More explicitly, the processor 102may infer what these actors' goals are by creating a machine learningmodel for the actors' behaviors. A result of this empathy is that theprocessor 102 may choose to facilitate the other actor in achieving thegoals of the agent 110. In this example, the processor 102 mayfacilitate the pedestrian crossing the street by stopping the agent 110to allow the pedestrian to pass. The specific mechanism by which theprocessor 102 may make a decision to facilitate the action of otheractors may be programmed in to the processor 102, either in the form ofa reward mechanism in the processor's 102 control model, or throughexplicit programming. Such supplemental programing may have beendetermined at the time the agent 110 was designed. The contrary may alsobe indicated, where a processor 102 may identify an actor the processor102 wishes to prevent form achieving the actor's goal, for example,preventing an intruder from breaking and entering in the case of asecurity robot, or other similar scenario.

An example of the present disclosure is provided in which the agent 110is a virtual agent that may react to decisions made by non-roboticactors. For instance, the agent 110 may be a virtual teammate in a videogame or other simulation. In most simulations, artificial intelligence(AI) controlled actors may all be controlled by a single AI. Thecontrolled actors may all interact with each other because a singleprogram is orchestrating and coordinating the behaviors of the actors.However, it is not possible to predict the behaviors of human actors inthe simulation. And, if there are multiple human actors in a multi usersimulation, predicting the actions of all the human actors, how theyreact to the actions of other human actors, or how they react to theactions of virtual agents, is normally considered a very hard problem.

However, using the models described herein, each virtual agent maycreate a model to try to identify the goals of each actor, human orvirtual. For human actors, the model may try to estimate what the goalsare, and then may build a pre-trained model optimized for those goals.However, the pre-trained model used to predict the human actor'sbehavior may also use a model that may incorporate predictions of othervirtual agents as well as other human actors. This model may produce ahigh fidelity prediction of both human and virtual actors in thesimulation. The result may be a virtual agent that may interact with,either cooperatively or adversarially, multiple simultaneous human andvirtual actors.

By way of particular example, a video game may include two teams, a redteam and a blue team and each team may have one human actor and onevirtual (or AI) actor. In this example, the red team human actor may beabout to throw a grenade and both the virtual actors in the red and blueteams, may also model the behaviors of all the other actors, both humanand virtual. The red team virtual agent may predict that the red humanactor is going to throw a grenade. Simultaneously, however, the red teamvirtual agent may make a prediction for the reaction of both of the blueteam actors. The blue team virtual agent may do the same thing, whichmay allow the blue team virtual agent to make an action plan based onwhat the blue team virtual agent has determined what the blue team humanactor is going to do, as well as a prediction for both red team actors.In this way, the virtual agents for both teams may make high fidelitypredictions of all the actors and make action plans to cooperatenaturally with their team mates as well as cooperatively attacking theopposite team.

An example of the present disclosure is provided in which the agent 110is a robot designed to interact with humans in a natural way, e.g., as ahelper, a toy, and/or the like. The processor 102 in this example mayplan accurate actions for the agent 110 when the processor 102 has madesome prediction about what other humans in its environment might bedoing. This may be helpful because for a processor 102, human behavioris difficult to anticipate. As a result, robots may get in the way, getknocked over, or possibly collide with a person. Having a predictivemodel as disclosed herein to at least guess at what humans may do nextmay enable the processor 102 to operate accurately in these types ofsituations, especially in a situation where multiple human and perhapsmultiple robot actors are interacting in the same environment.

An example of the present disclosure is provided in which the agent 110is a robot that may see itself in a mirror or in an image captured by avideo camera. In this example, the processor 102 may identify that theagent 110 has seen itself and may model the agent's 110 own behavior.Otherwise, the processor 102 may not know that this is some other robotbehaving independently. In some examples, the processor 102 may havespecial programming to detect this case and resolve the paradox. Inother examples, the processor 102 may identify the agent 110 in otherways, such as through a built in image recognition model and mayidentify itself visually. As a further example, the agent 110 may beequipped with a light that the agent 110 may blink on demand. In thisexample, the agent 110 may activate the light in a random pattern suchthat the processor 102 may know with certainty by the sequence offlashes that the agent 110 is looking at itself. In this case theprocessor 102 may substitute the control program for the predictivemodel to assign to the agent 110 it sees. In other words, the processor102 may identify the agent 110 when the agent 110 sees itself and knowswhat the agent's 110 own actions will be.

Although described specifically throughout the entirety of the instantdisclosure, representative examples of the present disclosure haveutility over a wide range of applications, and the above discussion isnot intended and should not be construed to be limiting, but is offeredas an illustrative discussion of aspects of the disclosure.

What has been described and illustrated herein is an example of thedisclosure along with some of its variations. The terms, descriptionsand figures used herein are set forth by way of illustration only andare not meant as limitations. Many variations are possible within thescope of the disclosure, which is intended to be defined by thefollowing claims—and their equivalents—in which all terms are meant intheir broadest reasonable sense unless otherwise indicated.

The invention claimed is: 1.-20. (canceled)
 21. An apparatus comprising:a processor, and a memory on which is stored machine readableinstructions that when executed cause the processor to: access dataabout an environment of an agent; identify a first actor in theenvironment from the accessed data; access a first model that predicts afirst action of the first actor, apply the first model on the accesseddata to determine a predicted action of the first actor in theenvironment; determine an action that the agent is to perform in theenvironment based on the predicted action of the first actor; and causethe agent to perform the determined action in the environment.
 22. Theapparatus of claim 21, wherein, to determine the action that the agentis to perform, the instructions, when executed, cause the processor to:implement a recursive reinforcement learning model using the predictedaction of the first actor to determine the action that the agent is toperform.
 23. The apparatus of claim 21, wherein to determine the actionthat the agent is to perform, the instructions, when executed, cause theprocessor to: determine the action that optimizes a reward for the agentaccording to a reward policy.
 24. The apparatus of claim 21, wherein theinstructions, when executed cause the processor to: access imagescaptured around the agent to access the data about the environment; andapply a machine learning model on the accessed images to identify thefirst actor in the environment.
 25. The apparatus of claim 21, whereinthe instructions, when executed, cause the processor to: identify asecond actor in the environment from the accessed data; access a secondmodel that predicts a second action of the second actor, apply thesecond model on the accessed data of the environment to determine apredicted action of the second actor in the environment; and use thepredicted action of the second actor as a factor in determining thepredicted action of the first actor.
 26. The apparatus of claim 21,wherein the instructions, when executed, cause the processor to:identify a second actor in the environment from the accessed data;access a second model that predicts a second action of the second actor,apply the second model of the second actor as a first orderreinforcement learning model on the accessed data to determine apredicted action of the second actor in the environment; apply the firstmodel of the first actor as a second order reinforcement learning modelon the accessed data and the predicted action of the second actor todetermine the predicted action of the first actor in the environment;and apply a third order reinforcement learning model on the accesseddata and the predicted action of the first actor to determine the actionthat the agent is to perform in the environment.
 27. The apparatus ofclaim 26, wherein the instructions, when executed, cause the processorto: upload the determined action that the agent is to perform to aserver via a network, the server to update the third order reinforcementlearning model.
 28. The apparatus of claim 21, wherein the instructions,when executed, cause the processor to: determine a plurality ofcandidate predicted actions of the first actor through application ofthe first model on the accessed data; and based on the plurality ofcandidate predicted actions of the first actor, use a recursivereinforcement learning model to determine the action that the agent isto perform.
 29. The apparatus of claim 21, wherein the instructions,when executed, cause the processor to: simulate outcomes of candidateactions of the agent; and select one of the candidate actions that ispredicted to result in an optimal result as the action that the agent isto perform.
 30. A method comprising: accessing, by a processor, dataabout an environment of an agent; identifying, by the processor, a firstactor in the environment from the accessed data; accessing, by theprocessor, a first model that predicts a first action of the firstactor; applying, by the processor, the first model on the accessed datato determine a predicted action of the first actor in the environment;determining, by the processor, an action that the agent is to perform inthe environment based on the predicted action of the first actor; andcausing, by the processor, the agent to perform the determined action inthe environment.
 31. The method of claim 30, wherein determining theaction that the agent is to perform includes: implementing a recursivereinforcement learning model using the predicted action of the firstactor to determine the action that the agent is to perform.
 32. Themethod of claim 30, wherein determining the action that the agent is toperform includes: determining the action that optimizes a reward for theagent according to a reward policy.
 33. The method of claim 30, furthercomprising: accessing images captured around the agent to access thedata about the environment; and applying a machine learning model on theaccessed images to identify the first actor in the environment.
 34. Themethod of claim 30, wherein: identifying a second actor in theenvironment from the accessed data; accessing a second model thatpredicts a second action of the second actor; applying the second modelon the accessed data to determine a predicted action of the second actorin the environment; and using the predicted action of the second actoras a factor in determining the predicted action of the first actor. 35.The method of claim 30, further comprising: simulating outcomes ofcandidate actions of the agent; and selecting one of the candidateactions that is predicted to result in an optimal result as the actionthat the agent is to perform.
 36. The method of claim 30, furthercomprising: identifying a second actor in the environment from theaccessed data; applying a second model of the second actor as a firstorder reinforcement learning model on the accessed data to determine apredicted action of the second actor in the environment; applying thefirst model of the first actor as a second order reinforcement learningmodel on the accessed data and the predicted action of the second actorto determine the predicted action of the first actor in the environment;and applying a third order reinforcement learning model on the accesseddata and the predicted action of the first actor to determine the actionthat the agent is to perform in the environment.
 37. A non-transitorycomputer-readable medium on which is stored computer-readableinstructions that when executed by a processor, cause the processor to:access data about an environment of an agent; identify a first actor inthe environment from the accessed data; access a first model thatpredicts a first action of the first actor; apply the first model on theaccessed data to determine a predicted action of the first actor in theenvironment; determine an action that the agent is to perform in theenvironment based on the predicted action of the first actor; and causethe agent to perform the determined action in the environment.
 38. Thenon-transitory computer-readable medium of claim 37, wherein theinstructions, when executed, cause the processor to: access imagescaptured around the agent to access the data about the environment; andapply a machine learning model on the accessed images to identify thefirst actor in the environment.
 39. The non-transitory computer-readablemedium of claim 37, wherein the instructions, when executed, cause theprocessor to: implement a recursive reinforcement learning model usingthe predicted action of the first actor to determine the action that theagent is to perform.
 40. The non-transitory computer-readable medium ofclaim 39, wherein the instructions, when executed, cause the processorto: identify a second actor in the environment from the accessed data;access a second model that predicts a second action of the second actor;apply the second model as a first order reinforcement learning model onthe accessed data to determine a predicted action of the second actor inthe environment; apply the first model as a second order reinforcementlearning model on the accessed data and the predicted action of thesecond actor to determine the predicted action of the first actor in theenvironment; and apply a third order reinforcement learning model on theaccessed data and the predicted action of the first actor to determinethe action that the agent is to perform in the environment.